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(54) Prediction and processing of failures in storage subsystem 



(57) Predictive failure analysis of a storage subsys- 
tem is efficiently conducted and data quickly recovered 
from a failed Read operation. This may be implemented 
in a storage system including a host coupled to a super- 
vising processor that couples to a parity-equipped RAID 
storage subsystem having multiple HDAs each includ- 
ing an HDA controller and at least one storage medium. 
In one embodiment, when an HDA experiences an error 
during a Read attempt, the HDA transmits a recovery 
alert signal to the supervising processor; then, the proc- 
essor and HDA begin remote and local recovery proc- 
esses in parallel. The first process to complete provides 
the data to the host and the second process is aborted. 
In another embodiment, an HDA's PFA operations are 
restricted to idle times of the HDA. A different embodi- 
ment limits HDA performance of PFA to times when the 
processor is conducting data reconstruction. Another 
embodiment monitors HDA errors at the supervisory 
processor level, initiating an HDA's PFA operations 
when errors at that HDA have a certain characteristic, 
such as a predetermined frequency of occurrence. 
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Description 

The present invention relates to the prediction and/ 
or processing of failures in digital data storage systems. 
More particularly, the invention concerns a method and 
apparatus for efficiently conducting predictive failure 
analysis of a storage subsystem and for more quickly 
providing an output of data after a failed Read operation. 

Generally, a digital data storage subsystem is an 
assembly of one or more storage devices that store data 
on storage media such as magnetic or optical data stor- 
age disks. In magnetic disk storage systems, a storage 
device is called a head disk assembly ( t HDA"), which 
includes one or more storage disks and an HDA control- 
ler to manage local operations concerning the disks. 

A number of known storage subsystems incorpo- 
rate certain techniques and devices to predict storage 
device failures, along with other techniques and devices 
to quickly recover from device failures. As discussed be- 
low, however, these systems may not be completely ad- 
equate for use in certain applications. 

Predictive Failure Analysis 

A number of known storage subsystems employ 
predictive failure analysis ("PFA") to enhance their stor- 
age operations. PFA, which generally serves to detect 
symptoms indicative of an impending storage failure, 
may be implemented in a number of different ways. In 
a typical storage subsystem, the HDA electronics con- 
duct PFA operations for associated storage media, 
among other functions. Typically, when an HDA detects 
an impending error during PFA operations, the PFA rou- 
tine notifies the storage subsystem of the impending fail- 
ure. 

Although PFA operations are often useful in recog- 
nizing impending storage subsystem failures, they may 
impede ongoing storage tasks of the HD As. In particular, 
a typical PFA routine may require several hundred mil- 
liseconds to complete. Depending upon the particular 
design of the HDA. during PFA operations the HDA may 
be (1) capable of conducting limited Read or write op- 
erations, at best, or (2) unavailable for processing any 
Read or Write operations, at worst. Some applications 
may be unable to bear such impairments to perform- 
ance of the HDA's data storage and retrieval functions, 
albeit temporary. 

Data Recovery and Reconstruction 

When an HDA tails due to an error occurring in a 
storage device and a user, application program, or other 
process requests data from the HDA, some attempt 
must be made to provide the requested data in spite of 
the storage device failure. This process, called 'data re- 
covery", involves determining the contents of the re- 
quested unavailable data and providing the data as an 
output of the HDA. In many cases, recovery includes 



two components: data "retry 8 and data 'reconstruction." 

Data retry involves the HDA controller of the failed 
storage device executing a prescribed data retry routine 
having a finite number of "retry" steps. For example, the 
5 HDA may perform multiple attempts to recover failed da- 
ta while varying certain parameters to possibly improve 
the chances of recovering the data. Since each retry re- 
quires at least one disk rotation, and the entire recovery 
procedure can require multiple rotations, the retry proc- 
10 ess may consume a significant amount of time before 
finally recovering the data. 

In contrast to data retry, data "reconstruction" in- 
volves the process of reproducing data of the failed stor- 
age device using data from other sources and stored 
15 parity computations. For a more detailed explanation of 
various reconstruction schemes, reference is made to 
The RAIDbook: A Source Book for Disk Array Technol- 
ogy, Fourth Edition (August 6, 1994), published by The 
RAID Advisory Board, St. Peter MN. As is known. RAID 
20 versions subsequent to RAID-0 employ parity to en- 
hance data reliability. 

Some known storage systems employ a two-step 
data recovery procedure. After the HDA unsuccessfully 
exhausts its retry attempts (first step), the HDA requests 
25 assistance from a supervising processor that oversees 
operations of the multiple HDAs in the storage system. 
The supervising processor then employs data recon- 
struction techniques, such as parity reconstruction, to 
recreate the otherwise lost data (second step). Even in 
30 RAID systems, however, two-step data recovery may be 
unsatisfactory for some applications because it is too 
time consuming. Not only might an unsuccessful HDA 
retry routine require considerable time to complete on 
the HDA level, but the data reconstruction process per- 
35 formed atthe supervising processor level may add a sig- 
nificant delay of its own. 

The present invention encompasses a number of 
different aspects which relate to predictive failure anal- 
ysis of a storage subsystem and/or recovery of data 
40 from a failed read operation. The hardware environment 
of the system for one or more of the aspects may com- 
prise a storage subsystem including a host coupled to 
a supervising processor that couples to a parity- 
equipped RAID storage system having multiple HDAs, 
45 each HDA including an HDA controller and at least one 
storage medium. 

According to a first aspect when an HDA experienc- 
es an error during a read attempt, the HDA transmits a 
"recovery alert* signal to the supervising processor. Af- 
so ter transmission of this signal, the processor and HDA 
begin remote and local recovery processes in parallel. 
In particular, the processor performs data reconstruction 
while the HDA performs data retry. The first process to 
complete provides the data to the host, and the second 
55 process is aborted. 

According to a second aspect an HDA's PFA oper- 
ations are restricted to the HDA's "idle" times, i.e. peri- 
ods o! time beginning when there has been no storage 
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access for a predetermined period of time. 

According to a third aspect the HDA performance 
of PFA is limited to times when the processor is conduct- 
ing data reconstruction, to completely avoid any HDA 
"down time' due to PFA exercises. 

According to a fourth aspect, HDA errors are mon- 
itored at the supervisory processor level, and an HDA's 
PFA operations are initiated when errors having a cer- 
tain characteristic are detected at the HDA. This char- 
acteristic, for example, may involve the occurrence of a 
predetermined number of errors within a certain time p e- 
riod, or the occurrence ot a number of errors in a specific 
range of tracks within a certain time period. 

The present invention also encompasses a data 
storage medium tangibly embodying a machine-reada- 
ble program to perform the method steps of each of the 
aforementioned methods. 

Furthermore, the present invention encompasses a 
data storage subsystem for performing each of the 
aforementioned methods. 

Embodiments of the invention will now be de- 
scribed, by way of example only, with reference to the 
accompanying drawings, in which like reference numer- 
als designate like parts throughout, wherein: 

FIGURE 1 is a block diagram ol exemplary hard- 
ware components for implementing the present in- 
vention; 

FIGURE 2 is a flowchart depicting a sequence for 
data reconstruction using a recovery alert tech- 
nique pursuant to one embodiment of the invention; 

FIGURE 3 is a flowchart depicting a sequence tor 
efficient PFA performance by idle time PFA restric- 
tion, pursuant to a second embodiment of the inven- 
tion; 

FIGURE 4 is a flowchart depicting a sequence for 
efficient PFA performance by performing PFA only 
in parallel with data reconstruction, pursuant to a 
third embodiment of the invention; 

FIGURE 5 is a flowchart depicting a sequence for 
efficient PFA performance by triggering PFA upon 
high-level error monitoring, pursuant to a fourth em- 
bodiment of the invention; and 

FIGURE 6 is an illustrative data storage medium on 
which may be stored program instructions for im- 
plementing the various embodiments of the inven- 
tion. 

As shown by the example of Figure 1 , the hardware 
components and interconnections of the invention may 
include a data storage system 100 that includes a host 
102 and a storage subsystem 101. The host 102 may 
comprise, for example, a PC, workstation, mainframe 



computer, or another suitable host. The storage subsys- 
tem 101 may be embodied in an IBM brand RAMAC ar- 
ray subsystem, for example. 

The storage subsystem 101 includes a supervisory 

s processor 104 coupled to a plurality of HDAs 108-113. 
The host 102 and processor 104 exchange commands 
and data, as discussed in greater detail bebw. The proc- 
essor 104 preferably comprises a microprocessor such 
as the INTEL model i960™. Each of the HDAs 108-11 3 

io is accessible via a storage interface 105. In this regard, 
the interface 1 05 may comprise an apparatus employing 
serial storage architecture (known as "SSA"), for exam- 
ple. In the illustrated example, each HDA 108-113 com- 
prises a magnetic storage disk such as a "hard drive." 

15 However, in certain app lications each HDA 1 08- 1 1 3 may 
comprise a number of different devices, such as optical 
storage disks, optical or magnetic tape media, RAM, etc. 

For use in some or all of the operational embodi- 
ments described below, it is preferred that the HDAs 

20 103-113 are operated as a parity-equipped RAID sub- 
system. For example, the well known RAlD-5 protocol 
may be used, in which case the supervisory processor 
1 04 comprises a RAI D controller. 

In the illustrated embodiment, the HDAs 108-113 

25 are identical, each including a number of components. 
The HDA 108, for instance, includes an HDA controller 
115. an armature 122 connected to the HDA controller 
115, and one or more storage media 127, which com- 
prise magnetic storage disks in the present example. 

30 Each HDA controller 115-118maybe embodied in a dif- 
ferent ASIC, for example. 

In the preferred embodiment, the supervisory proc- 
essor 104 manages operation of the storage subsystem 
101 by executing a series of computer-readable pro- 

35 gramming instructions. These programming instruc- 
tions may comprise, for example, lines o1 C++ code. 
These programming instructions may be contained in a 
memory 106, which preferably comprises a RAM mod- 
ule, but may instead comprise an EPROM, PLA, ECL, 

40 or another suitable storage medium. With respect to the 
supervisory processor 104, the memory 106 may be 
stand-alone or incorporated within the supervisory proc- 
essor 104. Alternatively, the programming instructions 
may be contained on a data storage medium external 

45 to the supervisory processor 104, such as a computer 
diskette 600 (Figure 6) . Or, the instructions may also 
be contained on a DASD array, magnetic tape, conven- 
tional "hard disk drive", electronic read-only memory, 
optical storage device, set of paper "punch* cards, or 

50 another data storage medium. In still another alterna- 
tive, the programming instructions may be contained in 
a reserved space of the storage subsystem 101, such 
as in a private file system space. 

The computer-readable instructions performed by 

55 the supervisory processor 104 may be further under- 
stood with reference to the detailed description of the 
operation of various embodiments, set forth below. 

In addition to the hardware aspect described above, 
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this invention contemplates a method aspect invoking 
various processes for operating a storage subsystem. 
Generally, the storage subsystem is operated to effi- 
ciently conduct predictive failure analysis of a storage 
subsystem and also to quickly recover data from a failed 
Read operation, as shown in th© following description. 

Recovery Alert 

Figure 2 depicts a sequence of tasks 200 that illus- 
trate one embodiment of the invention's operation. In 
this embodiment, when an HDA experiences an error 
during a read attempt, the HDA transmits a recovery 
alert signal to the supervising processor. Then, the proc- 
essor and HDA begin remote and local data recovery 
processes in parallel. The first process to complete pro- 
vides the data to the host, and the second process is 
aborted. 

More particularly, after the routine 200 begins in 
task 202, the processor 104 receives a Read request in 
task 204. Although this request originates from the host 
102 in this example, the request may alternatively orig- 
inate from another source, such as a user (not shown) 
when the user submits a request via a user interface 
device (not shown). After receiving the request, the 
processor 104 in task 204 issues a Read command to 
one or more of the HDAs 108-113, as appropriate to the 
Read command. 

Subsequently, oneof the HDAs 10B-113in task 206 
experiences a Read failure when attempting to carry out 
the Read command, and promptly issues a "Recover 
Alert' signal to the processor 104. After this signal is 
transmitted, two recovery processes are initiated in par- 
allel. 
Namely: 

1 . The HDA in task 208 begins a local retry process. 

2. Simultaneously, the processor 104 initiates data 
reconstruction in task 21 0. In the illustrated embod- 
iment, the processor 104 in task 210 orchestrates 
reconstruction of the unavailable data using RAID 
reconstruction techniques. This may involve, for ex- 
ample, applying an exclus'rve-OR operation to (1) 
data that corresponds to the failed data and is 
present in the remaining (non-failed) HDAs, and (2) 
parity bits that are stored in the HDAs and corre- 
spond to the failed data. 

In query 212, the processor 104 asks whether either 
of tasks 209 or 21 0 have completed. If not, tasks 208 
and 210 are permitted to continue in task 214. However, 
when the first one of the tasks 208/210 completes, the 
processor 104 in task 216 receives the recovered data 
produced by that task, and provides the data to the re- 
questing source (e.g. the host 102 or user). 

After task 216, the processor 104 aborts the slower 
one of tasks 208/210 in task 218. Thus, data recovery 



is performed as quickly as possible, since recovered da- 
ta is supplied from the faster of tasks 208 and 210. The 
sequence 200 ends in task 220. 

5 Idle Time PFA Restriction 

Figure 3 depicts a sequence of tasks 300 that illus- 
trate another embodiment of the invention's operation. 
Broadly, this embodiment restricts an HDA's PFA oper- 
10 ations to idle times of the HDA. The sequence 300 may 
be performed separately for each one of the HDAs 
108-113. To provide an example, the following discus- 
sion concerns performance of the sequence 300 for the 
HDA 108. 

1S After the routine 300 begins in task 302, the HDA 
controller 11 5 in query 304 determines whether the HDA 
108 is "busy" or "free. - The HDA 10B is "busy" when it 
is processing an access to data of Its storage media 1 27. 
If the HDA 10B is free, the processor 104 in query 306 

20 asks whether the HDA 1 0B has been free for more than 
a predetermined time. This predetermined time, which 
may be about 100 ms for example, establishes the 
length of time deemed as "idle" for the HDA 108. If the 
HDA 1 08 has been free for the predetermined time pe- 

25 riod. the processor 104 in task 310 instructs the HDA 
controller 115 to perform a PFA routine. The PFA rou- 
tine, for example, may be embodied in microcode con- 
tained in memory of the HDA 108. 

The HDA controller 115 continues its local PFA in 

30 query 312 and task 314 until a data access request is 
received from the host 102 via the processor 1 04. At this 
point, the processor 104 in task 316 instructs the HDA 
controller 115 to abort its local PFA, and control returns 
to query 304. As an alternative to steps 312. 314, and 

35 316, the HDA controller 115 may be permitted to com- 
plete its local PFA in spite of any data access requests 
that may occur. 

In contrast to the progression described above, 
control passes to query 308 if query 304 determines that 

40 the HDA 1 08 is busy, or if query 306 determines that the 
HDA 1 08 has not been free for the predetermined time. 
In query 30B, the processor 1 04 determines whether the 
HDA 108 has been busy for a second predetermined 
time period. This second predetermined time period es- 

45 tablishes the maximum length of time that the HDA can 
operate without conducting its PFA routine, regardless 
of the occurrence of any data access requests. Thus, if 
the HDA 108 has not yet been busy for the second pre- 
determined time period, the processor in query 308 

50 routes control back to query 304. Otherwise, however, 
the processor 104 advances to query 310 and progress- 
es as described above. 

PFA and Data Reconstruction in Parallel 

55 

Figure 4 depicts a sequence of tasks 400 that illus- 
trate another embodiment of the invention's operation. 
In this embodiment, HDA performance of local PFA op- 
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erations is limited to times when the processor is con- 
ducting data reconstruction. More particularly, after the 
routine 400 begins in task 402, the processor 104 re- 
ceives a Read request in task 404. As in the examples 
described above, this request originates from the host 
102, a user, an application program, or another process. 
After receiving the request, the processor 104 in task 
404 issues a Read command to one or more of the 
HDAs 108-113. 

Subsequently, one of the HDAs 108-113 in task 406 
experiences a Read failure when attempting to carry out 
the Read command. In response to this failure, the proc- 
essor 104 initiates two sequences in parallel. 
Namely: 

1 . On the HD A level, the failed HDA initiates a local 
PFA routine in task 408. 

2. On the supervisory processor level, the proces- 
sor 104 initiates data reconstruction. In the illustrat- 
ed embodiment, the processor 104 in task 410 or- 
chestrates reconstruction of the unavailable data 
using RAID reconstruction techniques. This may in- 
volve techniques as described above. 

Thus, the local PFA routine does not Impede the 
normal operation of the failed HDA 108. Namely, the 
P FA routine is performed during a period when the failed 
HDA 108 would be inactive nonetheless - while the 
processor performs data reconstruction to reproduce 
data from the failed HDA 108. 

After task 41 0, query 412 asks whether the proces- 
sor 104 has finished reconstructing the data. If not, re- 
construction continues in task 414. Otherwise, having 
completed reconstruction, the processor 104 in task416 
provides an output of the requested data to the host 1 02, 
user, or other requesting source. The sequence 400 
ends in task 418. 

High-Level Error Monitoring Triggering PFA 

Figure 5 depicts a sequence of tasks 500 that illus- 
trate another embodiment of the invention's operation. 
In this embodiment, HDA errors are monitored at the su- 
pervisory processor level. The supervisory processor 
104 initiates an HDA's PFA operations when errors at 
that HDA have a certain characteristic, such as a pre- 
determined frequency of occurrence. 

More particularly, after the routine 500 begins in 
task 502, the processor in task 504 receives notice of 
any data access errors occurring in the HDAs 108-113. 
Such data access errors, for example, may comprise 
failures of the storage media 127-130, data check er- 
rors, "seek errors' (e.g. failure of a HDA controller 
115-118 to properly align its armature 122-125 to de- 
sired data), and the like. In task 506, the processor 104 
records each data access error in an error log. Prefera- 
bly, separate error logs are maintained lor each one of 



the HDAs 108-113, although all errors may be kept in a 
common log instead. Therefore, tasks 504 and 506 to- 
gether supplement an error log to reflect all errors that 
occur in the HDAs 1 08-1 1 3 that are reported to the proc- 

s essor 104. 

In parallel with tasks 504 and 506, the system 100 
in task SOB continues to conduct normal HDA opera- 
tions, such as Read and Write operations. Alternatively, 
tasks 504 and 506 may be conducted on an interrupt or 

10 other appropriate basis, rather than being performed in 
parallel with tasks 504 and 508. 

From time to time, the processor 1 04 determines in 
query 510 whether it is time to evaluate the error logs 
for the HDAs 108-113. Such evaluation may be trig- 

is gered based upon a number of different events, such as 
expiration of a predetermined time period, addition of a 
predetermined number of errors to an HDA's error log, 
etc. II the processor 104 determines that evaluation is 
not yet warranted, normal operations are continued in 

20 tasks 51 2 and then 508. 

When evaluation time arrives, the processor 104 in 
task 514 evaluates the error iog(s). In particular, the 
processor 1 04 conducts a remote PFA routine to detect 
trends and dangerous characteristics indicative of an 

25 impending HDA failure. Such characteristics, tor exam- 
ple, may be the occurrence of a number of errors within 
a certain time, or the occurrence of a number of errors 
within a certain range of tracks of a storage media within 
a certain time. 

30 If the processor 104 in query 51 6 finds that this eval- 
uation lacks features indicative of an impending failure, 
normal HDA operations are continued in tasks 512 and 
then 508. If, however, signs of an upcoming failure are 
found, the processor 104 in task 518 instructs the sus- 

35 pect HDA to initiate a local PFA routine. Then, normal 
HDA operations are continued in tasks 512 and 508. 

Thus has been described systems and methods 
which afford the user with a number of distinct advan- 
tages. First, increased access is provided to data stored 

*o in HDAs, since HDA performance of local PFA routines 
is selectively limited. Additionally, one embodiment of 
the invention provides faster data recovery, since proc- 
essor-level and HDA-level recovery procedures are in- 
itiated in parallel. 

*s While there have been shown what are presently 
considered to be preferred embodiments of the inven- 
tion, it will be apparent to those skilled in the art that 
various changes and modifications can be made herein 
without departing from the scope of the invention as de- 

so fined by the appended claims. 



Claims 

ss 1. a method for data recovery in a storage system in- 
cluding a supervising processor (104) coupled to a 
parity-equipped RAID storage subsystem (101) 
having multiple head disk assemblies ("HDA") 
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(103. .113) each HDA including an HDA controller 
(eg 115) and at least one storage medium (eg 127), 
said method comprising the steps ot: 

tha supervising processor receiving a Read re- 5 
quest for reading target data; 

the supervising processor directing a first HDA 
to read the target data; 

10 

the first HDA attempting to read the target data 
and detecting a data error during the attempt; 

the first HDA transmitting a recovery alert sig- 
nal indicative of the data error to the supervis- 
ing processor; 

the first HDA initiating a retry process to provide 
an output of the target data; 

20 

the supervising processor initiating a recon- 
struction process concurrently with the first re- 
covery process to provide an output of the tar- 
get data by combining supplementary data and 
stored parity, said supplementary data com- 25 
prising data corresponding to the target data 
and stored elsewhere in the RAID storage sub- 
system than the first HDA : and said stored par- 
ity comprising parity corresponding to the target 
data and supplementary data and stored in the 
RAID storage subsystem; and 

determining which of the retry and reconstruc- 
tion processes first completes; 

35 

providing an output of target data from the first 
completing process in response to the read re- 
quest; and 

aborting the process not completing first. *o 

2. The method of claim 1 , the reconstruction process 
including steps of applying an exclusrve-OR opera- 
tion to the supplementary data and stored parity. 

45 

3. The method of claim 1 or claim 2, wherein the RAID 
storage subsystem includes a spare HDA, and the 
method further comprises a step of rebuilding the 
target data upon the spare HDA. 

so 

4. A method for operating a storage system including 
a supervising processor (104) coupled to at least 
one head disk assembly ("HDA") (10B) each HDA 
including an HDA controller (115) and at least one 
storage medium (127), wherein the processor ac- 
cesses the at least one HDA at selected times to 
exchange data therewith, said method comprising 
the steps of: 



a first one of the at least one HDA determining 
whether a first predetermined time has elapsed 
since a most recent access of the first HDA by 
the processor; 

on a determination that the first predetermined 
time has elapsed, the first HDA performing a 
selected predictive failure analysis ("PFA") to 
predict future failure of the at least one storage 
medium of the first HDA. 

5. The method of claim 4. lurther comprising the steps 
of: 

the first HDA determining whether a second 
predetermined time has elapsed since a most 
recent performance of PFA by the first HDA; 
and 

if the second predetermined time has elapsed, 
the first HDA performing a selected PFA to pre- 
dict future failure of the at least one storage me- 
dium of the first HDA. 

6. The method of claim 4 : the step of the first HDA per- 
forming a selected PFA further including the steps 
of identifying potential causes of the predicted fu- 
ture failure. 

7. The method of claim 4 : the step of the first HDA per- 
forming a selected PFA further comprising the steps 
of: 

in response to any access by the processor 
of the first HDA to exchange data therewith, abort- 
ing the first HDA's performance of the selected PFA. 

8. A method for data recovery in a storage system in- 
cluding a host (102) coupled to a supervising proc- 
essor (104) coupled to a parity-equipped RAID stor- 
age subsystem having multiple head disk assem- 
blies ("HDA") each HDA including an HDA controller 
and at least one storage medium, said method com- 
prising the steps of: 

the supervising processor receiving a Read re- 
quest for reading target data; 

the supervising processor directing a first HDA 
to read target data; 

the first HDA attempting to read the target data 
and detecting a data error during the attempt; 

the supervising processor executing a recovery 
process to reconstruct the target data by com- 
bining supplementary data and stored parity, 
said supplementary data comprising data cor- 
responding to the target data and stored else- 
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where in the RAID storage system than the first 
HDA, and said stored parity comprising parity 
corresponding to the target data and supple- 
mentary data and stored in the RAID storage 
subsystem; 

concurrently with the supervising processor ex- 
ecuting the recovery process, the first HDA per- 
forming a selected predictive failure analysis 
("PFA") to predict future lailure of the at least io 
one storage medium of the first HDA; and 

after completion of the recovery process, pro- 
viding reconstructed target data from the super- 
vising processor to the host. is 



being that errors associated with the HDA and oc- 
curring within a predetermined period of time ex- 
ceed a predetermined numerical count. 

The method of claim 11 or claim 12, the selected 
characteristic being that errors associated with the 
HDA and occurring within a predetermined range of 
physical storage locations on the at least one stor- 
age medium exceed a predetermined numerical 
count. 

14. The method of claim 11 , the step of the supervising 
processor performing the second PFA further in- 
cluding the steps of identifying potential causes of 
the predicted future failure. 
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9. . The method of claim 8, further comprising the steps 

of: 

determining whether the recovery process or 
PFA completes first; and 

aborting the PFA in the event the recovery proc- 
ess completes first 

10. The method of claim 8 or claim 9, the step of the 
first HDA performing a selected PFA further includ- 
ing the steps of identifying potential causes of the 
predicted future failure. 

11. A method for operating a storage system including 
a supervising processor coupled to a storage sub- 
system having multiple head disk assemblies 
("HDA") each including an HDA controller and at 
least one storage medium, said method comprising 
the steps of: 

the supervising processor receiving notice of 
predetermined types of data access errors oc- 
curring in the storage subsystem; 

the supervising processor recording represen- 
tations ol the errors in an error log; and 

for each HDA, the supervising processor per- 
forming steps comprising: 

performing a first predictive failure analysis 
("PFA") to determine whether errors associated 
with the HDA have a selected characteristic; 
and 

if the errors associated with the HDA have the 
selected characteristic, directing the HDA to 
perform a second PFA to predict future failure 
of the at least one storage medium of the HDA. 

12. The method of claim 11, the selected characteristic 



15. The method of claim 11 , the predetermined types of 
data access errors at the HDAs including seek er- 
rors. 

20 

18. The method of claim 11 , the predetermined types of 
data access errors at the HDAs including storage 
media failures. 

25 17, a data storage medium tangibly embodying a ma- 
chine-readable program of instructions to perform 
method steps for recovery in a storage system that 
includes a supervising processor coupled to a par- 
ity-equipped RAID storage subsystem having mul- 
30 tiple head disk assemblies ("HDA") each including 
an HDA controller and at least one storage medium, 
said method steps comprising: 

the supervising processor receiving a Read re- 
35 quest for reading target data; 

the supervising processor directing a first HDA 
to read the target data; 

40 the first HDA attempting to read the target data 

and detecting a data error during the attempt; 

the first HDA transmitting a recovery alert sig- 
nal indicative of the data error to the supervis- 
es ing processor; 

the first HDA initiating a retry process to provide 
an output of the target data; 

so the supervising processor initiating a recon- 

struction process concurrently with the first re- 
covery process to provide an output of the tar- 
get data by combining supplementary data and 
stored parity, said supplementary data com- 
5S prising data corresponding to the target data 

and stored elsewhere in the RAID storage sub- 
system than the first HDA. and said stored par- 
ity comprising parity corresponding to the target 
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data and supplementary data and stored in the 
RAID storage subsystem; and 

determining which of the retry and reconstruc- 
tion processes first completes; s 

providing an output of target data from the first 
completing process in response to the Read re- 
quest; and 

10 

aborting the process not completing first 

18. A data storage medium tangibly embodying a ma- 
chine-readable program of instructions to perform 
method steps for operating a storage system includ- 1$ 
ing a supervising processor coupled to at least one 
head disk assembly ("HDA") each HDA including an 
HDA controller and at least one storage medium, 
wherein the processor accesses the at least one 
HDA at selected times to exchange data therewith, 20 
said method steps comprising: 

a first one of the at least one HDA determining 
whether a first predetermined time has elapsed 
since a most recent access of the first HDA by 2S 
the processor; 

if the first predetermined time has elapsed, the 
first HDA performing a selected predictive fail- 
ure analysis ("PFA n ) to predict future failure of 
the at least one storage medium of the first 
HDA and 

if the first predetermined time has not elapsed, 
the first HDA refraining from performing any 35 
PFA. 

19. A data storage medium tangibly embodying a ma- 
chine-readable program of instructions to perform 
method steps for data recovery in a storage system *o 
including a supervising processor coupled to a par- 
ity-equipped RAID storage subsystem having mul- 
tiple head disk assemblies ('HDA") each HDA in- 
cluding an HDA controller and at least one storage 
medium, said method steps comprising: 4S 

the supervising processor receiving a Read re- 
quest for reading target data; 

the supervising processor directing a first HDA 50 
to read target data; 

the first HDA attempting to read the target data 
and detecting a data error during the attempt; 

55 

the supervising processor executing a recovery 
process to reconstruct the target data by com- 
bining supplementary data and stored parity, 



said supplementary data comprising data cor- 
responding to the target data and stored else- 
where in the RAID storage system than the first 
HDA, and said stored parity comprising parity 
corresponding to the target data and supple- 
mentary data and stored in the RAID storage 
subsystem; 

concurrently with the supervising processor ex- 
ecuting the recovery process, the first HDA per- 
forming a selected predictive failure analysis 
('PFA') to predict future failure of the at least 
one storage medium of the first HDA; and 

after completion of the recovery process, pro- 
viding reconstructed target data from the super- 
vising processor to the host. 

20. A data storage medium tangibly embodying a ma- 
chine-readable program of instructions to perform 
method steps for operating a storage system includ- 
ing a supervising processor coupled to a storage 
subsystem having multiple head disk assemblies 
('HDA*) each including an HDA controller and at 
least one storage medium, said method comprising 
the steps of: 

the supervising processor receiving notice of 
predetermined types of data access errors oc- 
curring in the storage subsystem; 

the supervising processor recording represen- 
tations of the errors in an error log; and 

for each HDA, the supervising processor per- 
forming steps comprising: 

performing a first predictive failure analysis 
CPFA") to determine whether errors associated 
with the HDA have a selected characteristic; 
and 

rf the errors associated with the HDA have the 
selected characteristic, directing the HDA to 
perform a second PFA to predict future failure 
of the at least one storage medium of the HDA 
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